This is an R markdown document intended to compare the performances of FROGS, MOTHUR, UPARSE and QIIME in terms of accuracy on both simulated and synthetic microbial communities.
The results of FROGS, MOTHUR and UPARSE, QIIME are compared using three different metrics:
The experimental design differed for the simulated communities (for which a full-factorial design was used) and the synthetic communities.
The simulated communities were built according to the following design:
This resulted in a total of 2 databanks \(\times\) 5 community sizes \(\times\) 2 abundance distribution \(\times\) 10 theoretical communities \(\times\) 10 replicates for each theoretical community \(\times\) 2 amplicons \(=\) 2000 samples (1000 per databank).
The experimental design used for the synthetic community was sligthly different:
| nb_OTU | amplicon | abundance_law | count |
|---|---|---|---|
| 20sp | V3V4 | even | 4 |
| 20sp | V3V4 | staggered | 4 |
| 20sp | V4V4 | even | 1 |
| 20sp | V4V5 | even | 1 |
| 4sp | V3V4 | uneven | 10 |
Three samples corresponding to communities of size 20 with abundance distribution even were used to compare the amplicon V3V4, V4V4 and V4V5 (1 per amplicon). 8 samples corresponding to communities of size 20 and amplicon V3V4 were used to compared abundance distribution even and staggered (4 per distribution) and finally, 19 samples (community size 4, amplicon V3V4 and distribution even) were used to compare the accuracy of the different otu picking methods.
For each of the three metrics (divergence, FN and FP) we performed two-sided paired test, either parametric (paired t-test) or non-parametric (signed rank test, also known as paired mann-whitney test) to assess the difference in accuracy between FROGS and each of the competitors.
The tests were peformed at the theoretical community levels (dataset) using biological replicates (set_number) as replicates. We chose to compare the methods at this level because it the finest one for which we have replication. Pooling different theoretical communities and/or abundance distributions to compare the method at higher levels (e.g community size \(\times\) amplicon) will blur the signal as a method may be outclass the others for even abundances but perform worse on different abundance disrtibutions.
For each theoretical community, we declared FROGS better (resp. worse) than its competitor when the test was significant at the 0.05 level and FROGS had a lower (resp. higher) metric than its competitor. When the test was not significant, the methods were declared tied. Finally, we aggregated the results to count for each condition (community size \(\times\) abundance distribution \(\times\) amplicon) the number of theoretical communities favoring one or none of the methods.
Before presenting the statistical analysis per se, we first present the results graphically for each of the databank and for the synthetic communities.
The comparisons of divergence at the sample level in the scatterplots shows that on average, FROGS has comparable but better performances than MOTHUR and UPARSE: most samples end up in the upper left corner (corresponding to the region “divergence FROGS < divergence competitor”) but no too far away from the first diagonal (grey line).
A more traditional representation using boxplot of the excess divergence of FROGS, with samples from all theoretical communities pooled together, confirms the results: FROGS has similar (compared to UPARSE and QIIME) or lower (compared to MOTHUR) divergence for the vast majority of samples. Note that the y-range was reduced from \([-51, 41]\) to \([-15, 3]\) in order to exclude from outliers (1% of communities with low FROGS but very high MOTHUR divergence or high FROGS but low QIIME divergences) and zoom in on the boxplots. The only configuration where FROGS is consistently outperformed is complex communities (number of species > 200) with uniform abundances and sequenced on the V4V4 region. In that configuration, FROGS is outperformed by QIIME.
Finally a focus on the accuracy of FROGS alone shows that divergence levels vary mostly between 0 and 15% and as expected, are higher for finer classifications (Species) than for coarse ones (Phylum). Unsurprisingly, the V3V4 amplicon gives less distorded view of communities than the V4V4.
We repeat the graphical exploration of the resutls with False Positive and False Negative OTUs. A first representation shows that use of the V3V4 amplicon leads to more false postive and less false negative than the V4V4. The graphics also highlight the gigantic number of false positive inferred by MOTHUR and QIIME (up to 20 times more than the real community size).
A focus on FROGS and UPARSE leads to similar patterns: FROGS always produces less false negatives than UPARSE but produces a bit more false positive under power law abundance distribution and a bit less under uniform abundance distribution.
The lower number of false positive in under power law abundances could be due to the abundance based filters used in UPARSE.
We present the results of the paired tests, either parametric (t-test, top) or non-parametric (signed rank test, bottom). Both tests show that FROGS perform as well or better as UPARSE and MOTHUR in most conditions. The only condition in which FROGS does worse than UPARSE is small community size (20). It also does better than QIIME in most settings, with the exception of large communities (size > 200) with uniform abundance studied using the V4V4 region.
The real strength of FROGS lies in its ability ot give a more accurate view of large communities (size > 200) at fine scales (Species or Genus level).
The same paired test as in the previous section reveal that FROGS strictly outperforms MOTHUR in terms of both FP and FN taxas. It also produces less FN than UPARSE and less FP than QIIME. Additionnally, it produces less FP than UPARSE for uniform distributions and more for power law ones.
Overall, FROGS produces less FP and less FN than either of UPARSE and MOTHUR for high community sizes (>200 for uniform distributions, >1000 for power law distributions) and less FP than QIIME at all sizes.
The comparisons of divergence at the sample level in the scatterplots shows that on average, FROGS has comparable but better performances than MOTHUR and UPARSE: most samples end up in the upper left corner (corresponding to the region “divergence FROGS < divergence competitor”) but no too far away from the first diagonal (grey line).
A more traditional representation using boxplot of the excess divergence of FROGS, with samples from all theoretical communities pooled together, confirms the results: FROGS has similar (compared to UPARSE) or lower (compared to MOTHUR) divergence for the vast majority of samples. Note that the y-range was reduced from \([-85, 31]\) to \([-15, 3]\) in order to exclude outliers (4% of outliers with excess divergence < -15 and 0.02% with excess divergence > 3) and zoom in on the boxplots. As expected, all methods perform quite similarly up to the order level and the main differences appear at the Family and Genus levels, where MOTHUR and MOTHUR_SOP and QIIME_SOP produces much larger divergences than competing methods. The only configuration where FROGS is consistently outperformed is complex communities (number of species > 200) with uniform abundances and sequenced on the V4V4 region. In that configuration, FROGS is outperformed by QIIME.
Finally a focus on the accuracy of FROGS alone shows that divergence levels vary between 0 and 10% and as expected, is higher for fine classification (Genus) than for coarse ones (Phylum). Unsurprisingly, the V3V4 amplicon gives less distorded view of communities than the V4V4. Overall, FROGS recover community compositions very well expect at the genus level for complex communities (size > 200), with uniform abundances and sequenced using the V4V4 region.
We repeat the graphical exploration of the resutls with False Positive and False Negative OTUs. A first representation shows that use of the V3V4 amplicon leads to more false postive and less false negative than the V4V4. The graphics also highlight the gigantic number of false positive inferred by mothur (up to 20 times more than the real community size).
A focus on FROGS and UPARSE leads to similar patterns: FROGS always produces less false negatives than UPARSE but produces a bit more false positive under power law abundance distribution and a bit less under uniform abundance distribution.
The lower number of false positive in under power law abundances could be due to the abundance based filters used in UPARSE.
We present the results of the paired tests, either parametric (t-test, top) or non-parametric (signed rank test, bottom). Both tests show that FROGS perform as well or better as UPARSE and MOTHUR in most conditions. The only condition in which FROGS does worse than UPARSE is small community size (20). It also does better than QIIME in most settings, with the exception of large communities (size > 200) with uniform abundance studied using the V4V4 region.
The real strength of FROGS lies in its ability ot give a more accurate view of large communities (size > 200) at fine scales (Species or Genus level).
The same paired test as in the previous section reveal that FROGS strictly outperforms MOTHUR in terms of both FP and FN taxas. It also produces less FN than UPARSE. Additionnally, it produces less FP than UPARSE for uniform distributions and more power law ones. Overall, FROGS produces less FP and less FN than either of UPARSE and MOTHUR for high community sizes (>200 for uniform distributions, >1000 for power law distributions).
Due to the different design used for synthetic communities, we going to perform focused comparisons of the samples.
We study the amplicon effect on the even community with 20 species as it is the only one for which different amplicons were used. We represent the trends observed at different taxonomic levels. UPARSE seems to do better than FROGS and MOTHUR but there is only one sample per amplicon so we can’t assess the significance of that trend.
Note that all methods have high divergences compared to simulated datasets. This may reflect experimental limitations (sequencing and amplification bias, copy number variations, etc) rather than intrinsic complexity of the synthetic community and/or differences between the methods.
We study the abundance distribution effect on the community with 20 species sequenced with the V3V4 amplicon region. FROGS is better than MOTHUR and UPARSE on the staggered community and worse on the even one…
Although the base divergence is not very satisfying on either distribution.
As we have 4 replicates for this community, we can compare all methods using a paired t-test (there are not enough replicates for the non parametric signed rank test to reach significance). The statistical analysis confirm that FROGS outperforms MOTHUR and UPARSE on communities with staggered abundances.
Finally, we compare the three methods on a toy community with 20 species and even abundances. We have 19 replicates for this community, enough to use either the signed rank test and the paired t-test. Once again FROGS base divergence level is not fantastic
but in line with divergences obtained by competitors
The statistical analyses confirms the graphical diagnostic of the boxplot: FROGS is generally better than MOTHUR and tied with UPARSE, except at the Genus rank where it outperforms both of them. It also performs worse than UPARSE at the Phylum rank (for the paired t-test)